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Abstract 

\ • ' ■ ' ■. 

Basic to many psychological Investigations Is the question 

of agreement between observers who Independently categorize people. 
Several recent studies have proposed measures of agreement when a 
set of nominal scale categories have been pre-defined and imposed 
on both observers. This study, in contrast, develops a measure o'f 
agreement fo:r settings where observers Independently define their 
own categories. Thus, it is, possible for observers to delineate 
different numbers of categories,^ with different names. Computa- 
tional formulae for the mean and variance of^the proposed agreement 
measure are given; further, a statistic with a large-sample normal 
distribution is suggested for testing the riull hypothesis of random 
agreement. A computer based comparison of the large sample approxi 
matic>n with the exact distribution of the test statistic shows a 
generally good fit, even for moderate sample sizes. Finally, a 
worked example involving two psychologists' classifications of 
children Illustrates the compdta ti ons • \ 
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Many variations of the problem of measuring agreement between 
two or more observers have been investigated by psychologists and 
statisticians. When measurements are taken on a variable with a 
continuous metric, agreement is generally expressed as a relia- 
bility or genera lizabi IJ-ty coefficient, /^s discussed thoroughly 

by Cronbach et. al.' (1 972) these coefficients are usually some 

\ 

version of the well-known intraclass correlation. 
■ -iw Suppose, on the other hand, that two psychologists each inde- 

•V ■ , , 

pendently distribute N people among a set of mutually exclusive 
categor.i^s. When categories are specified in advance, Cohen (1960) 
suggested a measure of agreement Kappa for two^bservers who each 
assign N people among these categories. This measure has been 
extended by Cohen ( 1 968 ), Ever i tt (1968), and Fleiss, Cohen, jand 
Everitt (1969). It was further extended to three or more observers 
by Light (1971) and Fleiss (1971). For measuring agreement a^mong 
several observers when each person is scored di chotomously , Fleiss 
(1965) has suggested procedures that are basically combinatorial* 
All of the suggestions have led to a useful and impressive se|t of 
procedures. All of these procedures, however, begin with the 



assumption that ^■J^categories, each with a specific name, hav^ 
been preselected, ^and that observers distribute people among 1;hese 
categories. 

The problem we consider here is somewhat different. Suppose 
two psychologists are asked to partition a group of people into 
several subgroups. The specific criteria for partitioning is left 
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up to each psychologist. Thus" the two psychologists may develop 
different numbers of subgroups. Moreover, since no precise set 
of subgroups have been labeled in advance, each psychologist may. 
use different cri teria resul ti ng in categories with different 
labels. This si tua tion i s i 1 1 ustra ted in Table 1, where two 
psychologists, after studying a group of children, independently 
categorized each of 15 children into one of three subgroups based 
on behavior patterns. 

Table 1 about here 

/ An important question to be asked of these^data is, "Do the 
two observers' lists agree beyond chance?" In the following sec- 
tfojLS-ijwe examine this question . We begin by developing a measure 
of a^jteement that provides a basis for testing the null hypothesis 
of random agreement. In order to examine the behavior of this 
.measure, we present computational formulae for its mean and 
variance which are then . incorporated into a large sample test 
statistic. Finally, after i nvesti ga ting the appropriateness of 
the test statistic for moderate sample sizes, we apply our procedu 

to- the data in Table 1. 

«* ■ - ■ 

Developing a Measure of Agreement 

Viewing our data in' the format of a two dimensional contin- 
gency table will help to clarify the definition of agreement. The 
raw data in Table 1 are an example of data that can always be dis- 
played in an R X c table, as illustrated in Table 2. Notice that 
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the first observer's categories are indexed by a^, i. = ..,R, 
while the second observer's categories are b., j = 1,*..,C. Note 
also that this format differs from that of Cohen (1960), Light 
(1971), and Fleiss (1971) in that here R can be different from C , 
and the row categories are not necessarily the same as the column 
categories. In Table 2, n.. represents the number of persons 
classified into category a. by observer 1, and into category b. by 

observer 2. Finally 

* . . » 

C R ' ^ 

"i+ ^ -^/iJ ^ ^ R; and n^^ * .^,"ij ^ 1.....C 

Table 2 about here 



How can agreement be measured from this table? The strategy 
is to study all possible pairs of children, and classify each pair 
as an agreement or disagreement in the following way. Let us 
focus on the particular pair of children, Adam and Bonnie. :Let 
this pair constitute an agreement if: 

a. Observer 1 cl ass i fi es Adam and Bonnie into the same cate- 
gory, say a., and Observer 2 c'iasslfies Adam and Bonnie 
into the same category, say b ^ , or 

b. Observer 1 classifies Adam and Bon'nie into different cate 
gories, and Observer 2 classifies Adam and Bonnie into 
different categories. 

Any other situation constitutes a disagreement, as summarized in 
Table 3. . • 



Table 3 about here 
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Given these definitions, the concepts of agreement and dis- 
agreement have a relatively simple interpretation in terms of the 
cell entries in Table. 2. If the two persons in any given pair are 
in the same cell, or they are in neither the same Vow nor the same • 
column, then that pair constitutes an "agreement". Using this 
idea, and remembering that any table will b^ve (g) possible pairs, 
the total number of observed agreements in any table wi^Jl reduce to: 

(1) A' = (N) * j j - Iri}/,, * jAi) ■ 

1=1 3=1 ^ 1=1 3=1 

The Expected Number of Agreements ^ 

For any two observers we wish to examine the observed number 
of agreements from (1), and compare this number to the number 
expected from "chance" agreement. Thus, we test the null hypothesis 
HqI a = E(A') against the one-tailed alternative hypothesis 

A > E(A'), where A is the population parameter. Let us now 
turn to the development of E(A'). 

The expected number of agreements for any observed set of data 

depends upon the cell entries, n.., of a table such as Table 2. 

But the distribution of the n. . depends upon whether the marginal 

'J " 

totals of the table are assumed to be fixed (hypergeometri c model) 
or variable (multinomial model). We will take the marginals to be 
fixed, although Kendall and Stuart (1967) point out that for large 
samples both assumptions lead to the same large sample distribution 
for cell entries. 
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Therefore, to find the expected number of agreements in any 

given table, we take the expected value of A' from (1). This 
becomes: 

(2) ■ E(A') = - Iqn^, * U%) * I IT V * i. . ' 

I J I j lfJ «»J ' 

In equation (2) , 

(3) ,M - E[n^j(n,.-1)...(n^j-r41)i "]) 

where, in general, n'^'^-^ = n(n-l ) . • . (n-r+1 ) • (Kendall and Stuart, 
1967J 

Finding the Variance of A' . 

* 

We now develop the exact variance for the agreement stati s tic . 
Although the formulae to follow appear tedious, they are straight- 
forward to apply. The variance of A' can be expressed generally as: 

R C R C R C o 

(4) V?r(A') = I I V(n?j) + J I I I Cov{n2..n2^) . l^K cnJi \^4,fl^ 

1 j 1 j k A ' 

where the subscripts k and A are alternative row and column sub- 
scripts respectively. To find the first term in (4), we take: 

(5) Var(n2.) ^ E(„4p v[E('^ij)]^ 

where, in terms of the notation introduced in (3), 
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(6a) ;E("?j) ',|;^[4]*^^-m*',M^u] V.M^m 

and ■ ■ • 

(6b) 



Finding the second term of (4) requires breaking the overall 
covariance into three parts. The three parts are: 

\ 

; X ■ I. . 

(7a) Cov(n^j.,n^^) = E(n?j,n^^) ' ^^r\\.)l{r\\^) where j / a ; 



(7h)" Cov(n2..n2.) = E(n2..n2.) 



_ ( 



E(n?j )E(n^j.) where \ f k ; and 



(7c) Cov(n2^..n^^) = Ein^^.n^^)/- E(n2^.)E(n2^) where i and j I. 

The factors in the last term in (7a, b, and c) can be found using 

the" format given in (6b) above. ' The computing formulae for the . 
first termsi in (7a, b, and c) are given in Table 4, 



Table 4 about here 



An Approximate Test of Significance 

The null hypothesis 'of random agreement can be tested using 
the statistic Z^.: 

, (8) Z,. = - M^') 
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which for large R, C, and N has a standard oormal distribution. 

An Empirical Investi ga^tibn of the Normal Approximation 

In theory, the approximation Z^,' given in (8) assumes large 
Rt C» and N. However, this type of normal approximation is fre- 
quently used with very modest sample sizes. We therefore under- 
took an empirical investigation to examine the validity of our 
approximation for several small tables. Specifically, we chose 



eight 3x3 tables, one each wi 



th an N of 15, 30, and 51, and five 



with N's of 42. The results ar^B presented in Table 5. 



Table 5 about here • 

\ i \ / 

■ "■ \ ■ ' ■ \ . ^ / 

For each of the eight cases in Table 5, we generated by com- 

■ \ - ■ ■ ' 

puter the exact probability distribution of the test statistic A'. 
Then, using the results of C2) and (4) above, we found the 0.05 

and 0.01 cutoff values for A' in terms of the normal iapproxima tion 

\ ■ * ■ ■ ■ . ■ . ' ^ ' ^ 

proposed' in (8). Finally, we found what proportion of the area 

under the 6xact distribution fell in the tail beyond these cutoff 

points. These results are given in Table 5. 

To illustrate how Table 5 was constructed, let lis focus on 

the first row. Here we have a table where each of the. three rows 

and each of the three columns has a marginal total of 5. The rfext 

three columns of Table 5 reference the normal approximation at the 

0.05 level of significance. Since E(A') = '62.14 and V(A') = 20.Ar, 

and = 1.65 at a «C^05, the estimated cutoff point for A' from 



the normal approximation is 62114 ■♦• 1.65/20»41 = 69.55. Since 
the exact distribution of A' is discrete, we choose the next 
highest value in the exact distribution. This value, 71.00, cuts 
off the top 0.064 of the exact distribution of A'. Similarly, 
for the 0.01 level of significance, the normal approximation cut- 
off value of A- equals 72.66, which corresponds to' a value of 75.00 
in the exact distribution. The tail area in the exact distribution 
beyond 75.00 is 0.016. ^ 

Three cd^nclusion^ emerge from Table 5. Firsti for moderately 
small sample sizes, and^ for tabl es of sma 1 1 dimensi oria 1 i ty (i.e., 
R ^= C = 3), the normal approximation to the distribution of A' is 
consistently quite good. Second, for tables witH' constant R = C = 3, 
increasing the sample size N has no dramatic effect on the quality 
of the normal approximation. This confirms the results expected 
from asymptotic normal theory (Wilks, 1962). wiiich are that in 
contingelncV tables such as ours, the validity of the normal approxi- 
mation is more affected by R and C than by N. Third, we investigated 
the effects of asymmetry in the marginal total's for the'five tables 
with N = 42. These results are given in cases 3 through 7, which 
indicate that the quality of the normal approximation is essentially 
unaffected by different degrees of skewness in the marginals. Over- 
all, then, for any but the smallest tab1«s, the normal approximation 
should provide a reasonable guide for a test of the null hypothesis 
of random agreement . 



A Worked Example - - * ' 

In this concludfng section, we Illustrate the computations 
for measuring aad testing agre»ement using the raw data from the 



ch^ildren in Table 1. We begin by placing the raw ydata^^i^^^ 
3''x 3 contingency table givWi in Table 6. Notice that all the 
uiarginal to(a^9 n^^ and n^j are 5, which follows from the data in 
Table 1 • 



, Table 6 about here 

Computing A* from (1): - 

A* = 105 (16 + 0 + 1 + 1) • 75 = 75 . 

^Computing E(A') from (2) requires us to find first y ^-j and ^».[2] 

; i • j i » J 

M ^ = 1*667 for all i,j; 
1 • J , 
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and 

.y.j;2] = = 1-905 for all i.j . 

Now, finding E{A' ): ^ 

E(A') = 105 - 75 + 9(1.667 + 1.905) = 62.143 . 
We now perform the slightly lengthier calculation of the ^ 



■ 4 y 



'A 



variance of A' from (4). First, we need to 
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and 



Then:, finding other, needed terms: 

, ' E(nJ.) = 0.440 + 6(1 •319)'+ 7(1 .905) + 1 .667 = 23.352 



E(n?j)=.( 1 .667 + 1 .905) = 3.5/1 



Var(n?.) = 23.352 - 3.571 = 10.597 



I I Var(n?.) = 9(10.597) = 95.369 

1 j V i 



From the first two computational formulae in Table 4: 



Frn2 „2 X . p/„2 2 . _ 5«4«5«4 . 5«4«3«5«5«4 
- ^^"ij'"kj^ - 1504 * 15.14.13 



5.4.3.5.4.5 . 5.4.3.2.5.4.5.4 _ Q OA9 

* 15. 14. 13 * 15. 14. 13. 12 " ^"^^^ 



and therefore, from formulae (7a) and (7b): 

/ ■ ■ / 



ir/d 



Cov(n?j.n5j^) = Cov(n?j,n^j) = 8.242 -(3.571)^ = -4.513 



Further, from the third formula, in Table 4: 
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cin2 „2 \ 5- 5-5'5 . 5«5'5'4'5-4 . 5 • 4 • 5 • 4 • 5 • ! 
. ''^"ij»"kjl' " 15-14 .15-14-13 15-14-13 



X 5'4«5'4'5'4'5'4 _ . 

* 15-1M3..12. - '^-'^^ * 



and from (7c): 



961 . 



Cov(n?j..n^j^) = 15. fee - (3.571)^ = 2.431. 



So, summing up all, the covariance terms* 

II I I <^ov(n2..n2^) = 36(-4.513) + 36(2.431 ) = -74. 
1 J k A 

Finally^ we. f tnd from (4) the variance oY A': 

Var(A') = 95.369^^- 74.961 = 20.408 . 

These results permit us to test the null hypothesis of random 
agreement for our example. Using the test in (8): 

- = 75. 000-62. 143 . ;.846 ■ ' . 

* ■'20.408 . 

Since the computed value of Z^, far exceeds the critical value of 
•1.65 for the 0.05 level of significance, we reject the null hypothesis 
for the data in Table 6, and conclude that agreement beyond/ chance 
exists between the two psychologists. 
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Genera 1 izabi 1 i ty of the Agreemen t'^Stati stic 

While^the development Insthls paper has focused entirely on 
the problem of measuring agreement between'two observers who 
categorize N people, the statistic A' has substantially broader 
application. We suggest foui^ such «gsS3BfiBaM» here. 

First, while our worked example involved placing people (young 
.children) into ca.tegories, the procedure is applicable for any 
phenomena that^n be nominally scaled . Thus, for example, two 
psychologists may have a list of^N behaviors to be classified 
according to some unspecified set of psychological disorders. 
Al ter^nati vely , two reading specialists may be asked to classify 
a set of N words into subgroups according to common student mis- 
conceptions in meaftiihg. Notice that in these two examples^ it was 
not people but rather phenomena that wene being categorized* 

Second, although in our worked example each dimension of the 
table represented one observer, there is no reason why the 
categorizations and the question of their agreement could not 
emerge from two groups working independently* 

Third, a recurring issue in appl ied research i n vol ves je^m- 
paring data from different studies (Light and Smith, 1971). / For 
example, suppose two states independently arrive at different 
classification schemes for the same set of job titles. If *one is 
interested in the extent to which these two schemes are cohsistsnt, 
or "agree", the measure A' and its test .statistic are applicable. 
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■I 

Fourth, while in our worked example R = C = 3, it will 
frequently occur that R / C. Therefore, the formulas for measuring 
agr^ement~a^^^^^ not r estri cte d-to square tab l es .- 

II. 
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Table 1 



Data on Two Psychologists/ Categorization of 15 Children 

, ^ Psychologist 1 

Subciroups 



Prima ri ly 
Interested in 
Athleti cs 



Primar j ly 
Interested in 
Populari ty 



Primari ly 
Interested in 
Scholarship 



^Adam (A) Francis (F) Kathy (K) 

B'bnnie (B) George (G) Larry / (L) 

^Claire (C) Harold (H) Michael (M) 

Pavid (D) Ira (I) Nora (N) 

Edward (E) Jennifer (J) Oscar (0) 



Primari ly 
Interested in 
Populari ty 



Psychologist 2 
Subgroups 

Primari ly 
Interested in 
Athletics 



No Clear 
Interests 



Adam (A) George (G) Edward (E) 

Bonnie (B) Kathy (K) Harold (H) 

Claire (C) \ Larry (L) Ira (I) 

David (D) Michael (M) Jennifer (J) 

Francis (F) Nora . (N) Oscar , (0) 
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.Table 2 

r 

Generalized RxC Contingency Table Format 
for Agreement Problem 

Observer 2 



Observer 1 





• 




b^ 






^1 


"11 


"12 


"Ij 


• • • 


"ic 


"1+ 


^2 


"21 


"22 


"2j 


• • • 


"2C 


"2+ 




"il 


"i2 




• • • 


"iC 


"U 




• 
• 
• 


• 
• 
• 


• 


• 
• 
• 


• 
• 
• 






"ri 


"R2 




• • • 


"rc , 


"r+ 




"+1 


"+2 




"+c, 


N 



Note--tSc^CIIiI::2 Using this format^C'.r'Y 
A 

a) Rows and/or columns may be permuted with 
no loss of information, 

b) It is not true that cells on the main 
diagonal represent^nd cells off the 
main diagonal represent disagreement. 
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Table 3 

Definitions of Agreement and Oisagreement 

Observer 2 



Observer 1 



Classi flcatlon 
of Any Pair 


Same 
Category 


Different 
Categories 


Same 
Category 


Agreement 


Disagreement 


Different 
Categories 


Di sagreemen t 


Agreement 



..;T IfV 
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Ta^ble 6 

' ^ Data from Table 1 Reformulated in a Contingency Table 

Format to Measure Agreement 

Observer 2 



Observer 1 





Popul ari ty 


Athletics 

/ 


None 




Athletics 


A.B.C.D 


/ 


E 


5 




(4) 


(0) 


(1) 




Populari ty ' 


F 


G 


H.I.J 


5 


(1) 


(1) 


(3) 








K. L.M.N 


0 . 




Scholarshi p 


(0) 


(4) 


(1) 


5 




5 


5 


5 


15 
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